Random Forest DBSCAN for USPTO Inventor Name Disambiguation
نویسندگان
چکیده
Name disambiguation and the subsequent name conflation are essential for the correct processing of person name queries in a digital library or other database. It distinguishes each unique person from all other records in the database. We study inventor name disambiguation for a patent database using methods and features from earlier work on author name disambiguation and propose a feature set appropriate for a patent database. A random forest was selected for the pairwise linking classifier since they outperformed Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Conditional Inference Tree, and Decision Trees. Blocking size, very important for scaling, was selected based on experiments that determined feature importance and accuracy. The DBSCAN algorithm is used for clustering records, using a distance function derived from random forest classifier. For additional scalability clustering was parallelized. Tests on the USPTO patent database show that our method successfully disambiguated 12 million inventor mentions in 6.5 hours. Evaluation on datasets from USPTO PatentsView inventor name disambiguation competition shows our algorithm outperformed all algorithms in the competition.
منابع مشابه
Disambiguation of patent inventors and assignees using high-resolution geolocation data
Patent data represent a significant source of information on innovation, knowledge production, and the evolution of technology through networks of citations, co-invention and co-assignment. A major obstacle to extracting useful information from this data is the problem of name disambiguation: linking alternate spellings of individuals or institutions to a single identifier to uniquely determine...
متن کاملFast Author Name Disambiguation in CiteSeer
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative machine learning framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, D...
متن کاملEfficient Name Disambiguation for Large-Scale Databases
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters p...
متن کاملPerson Name Disambiguation based on Topic Model
In this paper we describe our participation in the SIGHAN 2010 Task3 (Person Name Disambiguation) and detail our approaches. Person Name Disambiguation is typically viewed as an unsupervised clustering problem where the aim is to partition a name’s contexts into different clusters, each representing a real world people. The key point of Clustering is the similarity measure of context, which dep...
متن کاملUBC Entity Discovery and Linking & Diagnostic Entity Linking
This paper describe the runs submitted by the UBC team at TAC-KBP 2014 for both English Entity Discovery and Linking (EDL) and Diagnostic Entity Linking (DEL) tasks. Our main interest was to compare the performance between two totally different name entity recognizer systems and to combine them with three different name entity disambiguation systems that were developed for the TACKBP 2013 EL ta...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1602.01792 شماره
صفحات -
تاریخ انتشار 2016